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Abstract 



The model of a non-Bayesian agent who faces a repeated game with incomplete in- 
formation against Nature is an appropriate tool for modeling general agent-environment 
interactions. In such a model the environment state (controlled by Nature) may change ar- 
bitrarily, and the feedback/reward function is initially unknown. The agent is not Bayesian, 
that is he does not form a prior probability neither on the state selection strategy of Nature, 
nor on his reward function. A policy for the agent is a function which assigns an action to 
every history of observations and actions. Two basic feedback structures are considered. 
In one of them - the perfect monitoring case - the agent is able to observe the previous 
environment state as part of his feedback, while in the other - the imperfect monitoring 
case - all that is available to the agent is the reward obtained. Both of these settings 
refer to partially observable processes, where the current environment state is unknown. 
Our main result refers to the competitive ratio criterion in the perfect monitoring case. 
We prove the existence of an efficient stochastic policy that ensures that the competitive 
ratio is obtained at almost all stages with an arbitrarily high probability, where efficiency 
is measured in terms of rate of convergence. It is further shown that such an optimal 
policy does not exist in the imperfect monitoring case. Moreover, it is proved that in the 
perfect monitoring case there does not exist a deterministic policy that satisfies our long 
run optimality criterion. In addition, we discuss the maxmin criterion and prove that a 
deterministic efficient optimal strategy does exist in the imperfect monitoring case under 
this criterion. Finally we show that our approach to long-run optimality can be viewed as 
qualitative, which distinguishes it from previous work in this area. 

1. Introduction 

Decision making is a central task of artificial agents (Russell & Norvig, 1995; Wellman, 
1985; Wellman & Doyle, 1992). At each point in time, an agent needs to select among 
several actions. This may be a simple decision, which takes place only once, or a more 
complicated decision where a series of simple decisions has to be made. The question of 
"what should the right actions be" is the basic issue discussed in both of these settings, and 
is of fundamental importance to the design of artificial agents. 

A static decision-making context (problem) for an artificial agent consists of a set of ac- 
tions that the agent may perform, a set of possible environment states, and a utility /reward 
function which determines the feedback for the agent when it performs a particular action 
in a particular state. Such a problem is best represented by a matrix with columns indexed 
by the states, rows indexed by the actions and the rewards as entries. When the reward 
function is not known to the agent we say that the agent has payoff uncertainty and we 
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refer to the problem as a problem with incomplete information(Fndenbevg & Tirole, 1991). 
When modeling a problem with incomplete information one must also describe the underly- 
ing assumptions on the knowledge of the agent about the reward function. For example, the 
agent may know bounds on his rewards, or he may know (or partially know) an underlying 
probabilistic structure 1 . In a dynamic (multistage) decision-making setup the agent faces 
static decision problems over stages. At each stage the agent selects an action to be per- 
formed and the environment selects a state. The history of actions and states determines 
the immediate reward as well as the next one-shot decision problem. The history of actions 
and states also determines the next selected state. Work on reinforcement learning in ar- 
tificial intelligence (Kaelbling, Littman, & Moore, 1996) has adopted the view of an agent 
operating in a probabilistic Bayesian setting, where the agent's last action and the last state 
determine the next environment state based on a given probability distribution. Naturally, 
the learner may not be a-priori familiar with this probability distribution, but the existence 
of the underlying probabilistic model is a key issue in the system's modeling. However, this 
assumption is not an ultimate one. In particular, much work in other areas in AI and in 
economics have dealt with non-probabilistic settings in which the environment changes in 
an unpredictable manner 2 . When the agent does not know the influence of his choices on 
the selection of the next state (i.e., he is not certain about the environment strategy), we 
say that the agent has strategic uncertainty. 

In this paper we use a general model for the representation of agent-environment in- 
teractions in which the agent has both payoff and strategic uncertainty. We deal with a 
non-Bayesian agent who faces a repeated game with incomplete information against Nature. 

In a repeated game against Nature the agent faces the same static decision problem at 
each stage while the environment state is taken to be an action chosen by his opponents. 
The decision problem is called a game to stress the fact that the agent's action and the 
state are independently chosen. The fact that the game is repeated refers to the fact 
that the set of actions, the set of possible states, and the one shot utility function do not 
vary with time 3 . As we said, we consider an agent that has both payoff uncertainty and 
strategic uncertainty. That is, he is a-priori ignorant about the utility function (i.e., the 
game is of incomplete information) as well as about the state selection strategy of Nature. 
The agent is non-Bayesian in the sense that he does not assume any probabilistic model 
concerning nature's strategy and in the sense that he does not assume any probabilistic 
model concerning the reward function, though he may assume lower and upper bounds 4 . 
We consider two examples to illustrate the above-mentioned notions and model. Consider 

1. For example, the agent may know a probability distribution on a set of reward functions, he may assume 
that such a probability exists without any assumption on its structure, or he may have partial information 
about this distribution but be ignorant about some of its parameters (e.g., he may believe that the reward 
function is drawn according to a normal distribution with an unknown covariance matrix). 

2. There are many intermediate cases where it is assumed that the changes are probabilistic with a non- 
Markovian structure. 

3. In the most general setup, those sets may vary with time. No useful analysis can be done in a model 
where those changes are completely arbitrary. 

4. Repeated games with complete information, or more generally, multistage games and stochastic games 
have been extensively studied in game theory and economics. A very partial list includes: (Shapley, 
1953; Blackwell, 1956; Luce & Raiffa, 1957), and more recently (Fudenberg & Tirole, 1991; Mertens, 
Sorin, & Zamir, 1995), and the evolving literature on learning (e.g., Fudenberg & Levine 1997). The 
incomplete information setup in which the player is ignorant about the game being played was inspired 
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an investor, /, who is investing daily in a certain index of the stock market. His daily profits 
depends on his action (selling or buying in a certain amount) and on the environment state 
- the percentage change in the price of the index. This investor has complete information 
about the reward function because he knows the reward which is realized in a particular 
investment and a particular change, but he has strategic uncertainty about the changes 
in the index price. So, he is playing a repeated game with complete information against 
Nature with strategic uncertainty. 

Consider another investor, II, who invests in a particular mutual fund. This fund invests 
in the stock market with a strategy which is not known to the investor. Assume that each 
state represents the vector of percentage changes in the stocks, then the investor does not 
know his reward function. For example, he cannot say in advance what would be his profit 
if he would buy one unit of this fund and all stock prices increase in 1 percent. Thus, II 
plays a repeated game with incomplete information. If in addition II does not attempt to 
construct a probabilistic model concerning his reward function or market behavior, then he 
is non-Bayesian and our analysis may apply to him. For another example, assume that Bob 
has to decide on each evening whether to prepare tea or coffee for his wife before she gets 
home. His wife wishes to drink either tea or coffee and he wishes to have it ready for her. 
The reaction of Bob's wife to tea or coffee may depend on her state that day, which can not 
be predicted based on the history of actions and states in previous days. As Bob has just 
got married he cannot tell what reward he will get if his wife is happy and he makes her 
a cup of tea. Of course he may eventually know it, but his decisions during this learning 
period are precisely the subject of this paper. 

As an example for the generality of the above-mentioned setting, consider the model of 
Markov decision processes with complete or incomplete information. In a Markov decision 
process an agent's action in a given state determines (in a probabilistic fashion) the next 
state to be obtained. That is, the agent has a structural assumption on the state selection 
strategy. A repeated game against Nature without added assumptions captures the fact 
that the transition from state to state may depend on the history in an arbitrary way. 

When the agent performs an action a t in state s t , part of his feedback would be u(a t , s t ), 
where u is the reward function. We distinguish between two basic feedback structures. In 
one of them - the perfect monitoring case - the agent is able to observe the previous 
environment state as part of his feedback, while on the other - the imperfect monitoring 
case - all that is available to the agent is the reward obtained 5 . Notice that in both of 
these feedback structures, the current state is not observed by the agent when he is called 
to select an action 6 . Both investors / and II face a repeated game with perfect monitoring 
because the percentage changes become public knowledge after each iteration. 

In the other example, when Bob has to make his decision, if the situation is of imperfect 
monitoring, Bob would be only able to observe the reward for his behavior (e.g., whether 

by (Harsanyi, 1967). See Aumann and Maschler (1995) for a comprehensive survey. Most of the above 
literature deals with (partially) Bayesian agents. Some of the rare exceptions are cited in Section 6. 

5. Notice that the former assumption is very popular in the related game theory literature (Aumann & 
Maschler, 1995). Many other intermediate monitoring structures may be interesting as well. 

6. Such is also the case in the evolving literature on the problem of controlling partially observable Markov 
decision processes (Lovejoy, 1991; Cassandra, Kaelbling, & Littman, 1994; Monahan, 1982). In contrast, 
Q-learning theory (Watkins, 1989; Watkins & Dayan, 1992; Sutton, 1992) does assume a knowledge of 
the current state. 
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she says "thanks", "that's terrible", "this is exactly what I wanted", etc.). In the perfect 
monitoring case, Bob will be able to observe his wife's state (e.g., whether she came home 
happy, sad, nervous, etc.) in addition to his reward. In both cases Bob (like the investors) 
is not able to observe his wife's state before making his decision in a particular day. 

Consider the case of a one stage game against Nature, in which the utility function 
is known, but the agent cannot observe the current environment state when selecting his 
action. How should the agent choose his action? Work on decision making under uncertainty 
has suggested several approaches (Savage, 1972; Milnor, 1954; Luce & Raiffa, 1957; Kreps, 
1988). One of these approaches is the maxmin (safety-level) approach. According to this 
approach the agent would choose an action that maximizes his worst case payoff. Another 
approach is the competitive ratio approach (or its additive variant, termed the minmax 
regret decision criterion (Milnor, 1954). According to this approach an agent would choose 
an action that minimizes the worst case ratio between the payoff he could have obtained 
had he known the environment state to the payoff he would actually obtain. 7 Returning 
back to our example, if Bob would have known the actual state of his wife, he could choose 
an action that maximizes his payoff. Since he has no hint about her state, he can go ahead 
and choose the action that minimizes his competitive ratio. For example, if this action leads 
to a competitive ratio of two, then Bob can guarantee himself that the payoff he would get 
is at most half the payoff he could have gotten had he known the actual state of his wife. 

Given a repeated game with incomplete information against Nature, the agent would 
not be able to choose his one stage optimal action (with respect to the competitive ratio or 
maxmin value criteria) at each stage, since the utility function is initially unknown. So, if 
Bob does not initially know the reward he would receive for his actions as a function of his 
wife's state, then he will not be able to simply choose an action that guarantees the best 
competitive ratio. This calls for a precise definition of a long-run optimality criterion. In 
this paper we are mainly concerned with policies (strategies) guaranteeing that the optimal 
competitive ratio (or the maxmin value) is obtained in most stages. We are interested in 
particular in efficient policies, where efficiency is measured in terms of rate of convergence. 
Hence in Bob's case, we are interested in a policy that if adopted by Bob would guarantee 
him on almost any day, with high probability, at least the payoff guaranteed by an action 
leading to the competitive ratio. Moreover, Bob will not have to wait much before he will 
start getting this type of satisfactory behavior. 

In Section 2 we define the above mentioned setting and introduce some preliminaries. In 
Sections 3 and 4 we discuss the long-run competitive ratio criterion: In Section 3 we show 
that even in the perfect monitoring case, a deterministic optimal policy does not exist. 
However, we show that there exists an efficient stochastic policy which ensures that the 
long-run competitive ratio criterion holds with a high probability. In Section 4 we show 
that such stochastic policies do not exist in the imperfect monitoring case. In Section 5 we 
prove that for both the perfect and imperfect monitoring cases there exists a deterministic 
efficient optimal policy for the long-run maxmin criterion. In Section 6 we compare our 
notions of long-run optimality to other criteria appearing in some of the related literature. 
In particular, we show that our approach to long-run optimality can be interpreted as 



7. The competitive ratio decision criterion has been found to be most useful in settings such as on-line 
algorithms (e.g., Papadimitriou & Yanakakis, 1989). 
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qualitative, which distinguishes it from previous work in this area. We also discuss some of 
the connections of our work with work in reinforcement learning. 

2. Preliminaries 

A (one-shot) decision problem (with payoff certainty and strategic uncertainty) is a 3-tuple 
D =< A, S, u >, where A and S are finite sets and u is a real-valued function defined on 
Ax S with u(a, s) > for every (a, s) £ A X S. Elements of A are called actions and those 
of S are called states, u is called the utility function. The interpretation of the numerical 
values u(a,s) is context-dependent 8 . Let ua denote the number of actions in A, let ns 
denote the number of states in S and let n = max(ra^4, ns). 

The above-mentioned setting is a classical static setting for decision making, where 
there is uncertainty about the actual state of nature (Luce & Raiffa, 1957). In this paper 
we deal with a dynamic setup, in which the agent faces the decision problem D, without 
knowing the utility function u, over an infinite number of stages, t = 1,2,.... As we 
have explained in the introduction, this setting enables us to capture general dynamic non- 
Bayesian decision-making contexts, where the environment may change its behavior in an 
arbitrary and unpredictable fashion. As mentioned in the introduction, this is best captured 
by means of a repeated game against Nature. The state of the environment at each point 
plays the role of an action taken by Nature in the corresponding game. The agent knows 
the sets A and S, but he does not know the payoff function u. 9 A dynamic decision problem 
(with payoff uncertainty and strategic uncertainty) is therefore represented for the agent 
by a pair DD =< A, S > of finite sets 10 . At each stage t, Nature chooses a state s t £ S. 
The agent, who does not know the chosen state, chooses a t £ A, and receives u(a t , s t ). We 
distinguish between two informational structures. In the perfect monitoring case, the state 
s t is revealed to the agent alongside the payoff u(a t , s t ). In the imperfect monitoring case, 
the states are not revealed to the agent. A generic history available to the agent at stage t-\-l 
is denoted by h t . In the perfect monitoring case, h t £ Hf = (AxSxR+Y , where R + denotes 
the set of positive real numbers. In the imperfect monitoring case, h t £ H l t mp = (Ax R+Y- 
In the particular case t = we assume that H p = H^ np = {e} is a singleton containing 
the empty history e. Let H p = U^ H P and let H mp = U^ ii"™ p . The symbol H will be 
used for both H p and H' mp . A strategy 11 for the agent in a dynamic decision problem is 
a function F : H — > A (A) , where A (A) denotes the set of probability measures over A. 
That is, for every h t £ H, F(h t ) : A — > [0, 1] and J2 a eA F(ht) ( a ) = 1- I n other words, if 
the agent observes the history h t then he chooses a t+ i by randomizing amongst his actions, 
with the probability F(h t )(a) assigned to the action a. A strategy F is called pure if F(h t ) 
is a probability measure concentrated on a singleton for every t > 0. 

In Sections 2-4 the strategy recommended to the agent is chosen according to a "long- 
run" decision criterion which is induced by the competitive ratio one-stage decision criterion. 

8. See the discussion at Section 6. 

9. All the results of this paper remain unchanged if the agent does not initially know the set S, but rather 
an upper bound on ns- 

10. Notice that there is no need to include an explicit transition function in this representation. This is due 
to the fact that in the non-Bayesian setup every transition is possible. 

11. Strategy is a decision theoretic concept. It coincides with the term policy used in the control theory 
literature, and with the term protocol used in the distributed systems literature. 
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The competitive ratio decision criterion, that is described below, may be used by an agent 
who faces the decision problem only once, and who knows the payoff function u as well as 
the sets A and S. There are other "reasonable" decision criteria that could be used instead. 
One of them is the maxmin decision criterion to be discussed in Section 5, while another 
is the minmax regret decision criterion (Luce & Raiffa, 1957; Milnor, 1954). The latter is 
a simple variant of the competitive ratio (and can be treated similarly), and therefore will 
not be treated explicitly in this paper. 

For every s £ S let M(s) be the maximal payoff the agent can get when the state is s. 
That is 

M(s) = maxtifa, s). 

For every a £ A and s £ S define 

M(s) 

c(a,s) = — -. 

u(a, s) 

Denote c(a) = max se s c(a, s), and let 

CR = min c(a) = min ( max c(a, s) ) . 
aeA v ' aeA \ seS v V 

CR is called the competitive ratio of D =< A, S,u >. Any action a for which CR = c(a) 
is called a competitive ratio action, or in short a CR action. An agent which chooses a 
CR action guarantees receiving at least fraction from what it could have gotten, had it 
known the state s. That is, u(a, s) > -^M(s) for every s £ S. This agent cannot guarantee 
a bigger fraction. 

In the long-run decision problem, a non-Bayesian agent does not form a prior probability 
on the way Nature is choosing the states. Nature may choose a fixed sequence of states or, 
more generally, use a probabilistic strategy G, where G : Q — > A (5), and Q = L)^- Qt = 
U^ (A X Sy. Nature can be viewed as a second player that knows the reward function. Its 
strategy may of course refer to the whole history of actions and states until a given point 
and may depend on the payoff function. 

A payoff function u and a pair of probabilistic strategies F, G, where G can depend on u, 
generate a probability measure fj, = [j,f,g,u over the set of infinite histories Qoo = (A X 5)^ 
endowed with the natural measurable structure. For an event B C Qoo we will denote the 
probability of B according to fj, by fJ,(B) or by Prob^{B). More precisely, the probability 
measure fj, is uniquely defined by its values for finite cylinder sets: Let A t : Qoo A and 
St '■ Qoo S be the coordinate random variables which contain the values of the actions 
and states selected by the agent and the environment in stage t (respectively). That is, 
A t (h) = a t and St(h) = s t for every h = ((«i, si), («2j S2), ■ ■ .) in Qoo- Then for every T > 1 
and for every ((ax, si), . . . , (ay, sy)) £ Qt, 

T 

Prob^ ((A t , S t ) = (a t , s t ) for all 1 < t < T) = '[[F(<p t - 1 )(a t )G(il>t-i)(st), 

t=i 

where ipo and tpo are the empty histories, and for 2 < t < T we have 

tpt-i = ((«!, si), • • •, (at-i, st-i)) , 
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while the definition of <~pt-\ depends on the monitoring structure. In the perfect monitoring 
case, 

(ft-i = ((«i, si, u(a 1 , s^), (a t -i, s t -i, u(a t -i, s t -i))) , 
and in the imperfect monitoring case 

(ft-i = ((«i, u(a,i, s^), (a t -i, u(a t -i, s t -i))) ■ 

We now define some auxiliary additional random variables on Qoo- 

Let X t = 1 if c(A t , S t ) < CR and X t = otherwise, and let N T = Yj=i X t . 12 Let 5 > 0. 
A strategy F is S- optimal if there exists an integer K such that for every payoff function u 
and every Nature's strategy G 

Prob^Nj > (1 - 8)T for every T > K) > 1 - 8 (1) 

where ^ = /^f,g,u- A strategy F is optimal if it is 5-optimal for all 6 > 0. 

Roughly speaking, iVy measures the number of stages in which the competitive ratio (or 
a better value) has been obtained in the first T iterations. In an 5-optimal strategy there 
exists a number K, such that if the system runs for T > K iterations we can get with high 
probability that Nj is close to 1 (i.e., almost all iterations are good ones). In an optimal 
strategy we guarantee that we can get as close as we wish to the situation where all iterations 
are good ones, with a probability that is as high as we wish. Notice that we require that the 
above-mentioned useful property will hold for every payoff function and for every strategy 
of Nature. This strong requirement is a consequence of the non-Bayesian setup; since we do 
not have any clue about the reward function or about the strategy selected by Nature (and 
this strategy may yield arbitrary sequences of states to be reached), the best policy would 
be to insist on good behavior against any behavior adopted by Nature. Notice however that 
two other relaxations are introduced here; we require successful behavior in most stages, 
and that the whole process would be successful only with some (very high) probability. 

The major objective is to find a policy that will enable (1) to hold for every dynamic 
decision problem and every Nature strategy. Moreover, we wish (1) to hold for small enough 
K. If K is small then our agent can benefit from obtaining the desired behavior already in 
an early stage. 13 This will be the subject of the following section. We complete this section 
with a useful technical observation. We show that a strategy F is 5-optimal if it satisfies 
the optimality criterion (1) for every reward function and for every stationary strategy of 
nature, where a stationary strategy is defined by a sequence of states z = (s t )^. 1 . In this 
strategy Nature chooses s t at stage t, independent of the history. Indeed, assume that F is 
a strategy for which (1) holds for every reward function and for every stationary strategy 
of Nature, then we show that F is 5-optimal. 

Given any payoff function u and any strategy G, 5-optimality with respect to stationary 
strategies implies that for fj, = hf,g,ui 

Prob^Nr > (1 - S)T for every T > K)\Si, S 2 , . . .) > 1 - 5, 

12. Note that the function c(a, s) depends on the payoff function u and therefore so do the random variables 
X t and N t . 

13. The interested reader may wish to think of our long-run optimality criteria in view of the original work 
on PAC learning (Valiant, 1984). In our setting, as in PAC learning, we wish to obtain desired behavior, 
in most situations, with high probability, and relatively fast. 
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with probability one. Therefore 

Prob^Nj >(1-8)T for every T > K) > 1 - 8. 

Roughly speaking, the above captures the fact that in our non-Bayesian setting we 
need to present a strategy that will be good for any sequence of states chosen by Nature, 
regardless of the way in which it has been chosen. 

3. Perfect Monitoring 

In this section we present our main result. This result refers to the case of perfect monitor- 
ing, and shows the existence of a 5-optimal strategy in this case. It also guarantees that the 
desired behavior will be obtained after polynomially many stages. Our result is construc- 
tive. We first present the rough idea of the strategy employed in our proof. If the utility 
function was known to the agent then he could use the competitive ratio action. Since the 
utility function is initially unknown then the agent will use a greedy strategy, where he 
selects an action that is optimal as far as the competitive ratio is concerned, according to 
the agent's knowledge at the given point. However, the agent will try from time to time to 
sample a random action. 14 Our strategy chooses a right tradeoff between these exploration 
and exploitation phases in order to yield the desired result. Some careful analysis is needed 
in order to prove the optimality of the related strategy, and the fact it yields the desired 
result after polynomially many stages. 
We now introduce our main theorem. 

Theorem 3.1: LetDD =< A, S > be a dynamic decision problem with perfect monitoring. 
Then for every 6 > there exists a 6 -optimal strategy. Moreover, the 6 -optimal strategy 
can be chosen to be efficient in the sense that K (in (1)) can be taken to be polynomial in 
max(n, j) . 

Proof: Recall that ua and ns denote the number of actions and states respectively, and 
that n = max(ra^4, ns) ■ In this proof we assume for simplicity that n = ua = ns- Only 
slight modifications are required for removing this assumption. Without loss of generality, 
8 < 1. We define a strategy F as follows: Let M = |. That is, 

JL - - 
M ~ 8' 

At each stage T > 1 we will construct matrices Uj,C'j and a subset of the actions Wt in 
the following way: Define U[ (a, s) = * for each a, s. At each stage T > 1, if ax-i has been 
performed in stage T — 1, and sx-i has been observed, then update U by replacing the * 
in the (ay-i,,sx-i) entry with M(ay-i, sy-i). At each stage T > 1, if Uj(a,s) = *, define 

C%(a,s) = 1. If C/f (a,s) ^ *, C%(a,s) = max {b-.uF(b, s )^*}^fj^- Finally W T is the set 

of all a G A at which mm^A ma x se s Cj (6, s) is obtained. We refer to elements in Wt as 
the temporarily good actions at stage T. Let (Z t ) t >i be a sequence of i.i.d. {0, 1} random 

14. We use a uniform probability distribution to select among actions in the exploration phase. Our result 
can be obtained with different exploration techniques as well. 
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variables with Prob(Z t = 1) = 1 — This sequence is generated as part of the strategy, 
independent of the observed history. That is at each stage, before choosing an action, the 
agent flips a coin, independently of his past observations. At each stage t the agent observes 
Z t . If Z t = 1, the agent chooses an action from Wt by randomizing with equal probabilities. 
If Z t = the agent randomizes with equal probabilities amongst the actions in A. This 
complete the description of the strategy F. Let u be a given payoff function, and let (s t )^. 1 
be a given sequence of states. We proceed to show that (1) holds with K being the upper 
integer value of a = max(«i + 2, «2 + 2), where 



a i 



128, /256\ 
—pr- In —r- and a-i 



n 2 (nf + l)ln 



Recall that X t = 1 if c(A t ,s t ) < CR and X t = otherwise, and that iVy = J2t=i-^t- By 
a slight change of notation, we denote by P^ = Prob^ the probability measure induced by 
F , u and the sequence of states on (A X 5 X {0, l})oo (where {0, 1} corresponds to the Z t 
values) . 



Let e = g. Define 



B K = > ( X " ~ £ ) T for a11 T > K | • 

Roughly speaking, Bk captures the cases where temporarily good actions are selected 
in most stages. 

By (Chernoff, 1952) (see also (Alon, Spencer, k Erdos, 1992)), for every T, 

Recall that given a set 5, S denotes the complement of S. 
Hence, 

oo / T 



Therefore 



Since K > «i, 



Define: 



p,{b k ) < y: p. ( E * < (i - ^ - £ ) T )< E e " 

T=fc \t=l / T=A' 



^ (Bk) < / e~ — dT=- 
Jk-i £ 2 



L K = {N T > (1 - 5)T for every T > K}. 

Roughly speaking, Lk captures the cases where competitive ratio actions (or better 
actions in this regard) are selected in most stages. 

In order to prove that F is 5-optimal (i.e., that (1) is satisfied), we have to prove that 

P^(Lk) < 5 (3) 
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By (2) it suffices to prove that 

P,(L^\B K )< 6 - (4) 

We now define for every t > 1, s £ S and a £ A six auxiliary random variables, Y t , R t , Y t s , R^, Y t s ' a , i?^' a . 
Let Y t = 1 whenever Zt = 1 and = 0, and Y t = otherwise. Let 

T 

t=i 

For every s £ S let Y/ = 1 whenever Y t = 1 and s 4 = s, and Y/ = otherwise. Let 

t=i 

For every s £ S and for every a £ A, let Y/' a = 1 whenever Y/ = 1 and A t = a, and 
Y t s ' a = otherwise. Let 



t=i 

Let g be the integer value of j8K. We now show that 

P^ < P M (3T > K , # T > g | £ A -) (5) 

In order to prove (5) we show that 

n B K C {3T > K , R T > g} n 5*-. 

Indeed, if w is a path in such that for every T > K i?y < 9, then, at w, for every 
T > K, 

T 

N T > J2 X t >V T -J2 Y t> 
l<t<T,Z t =l t=i 

where Vj denotes the number of stages 1 < t < T for which Z t = 1. Since w £ Bk, 

N T >(l-jj-e)T-R T >(l-±-e)T-g 

for every T > K. Since ^ = e = § and # < \8K, N T > (1 - 5)T for every T > K. Hence, 
-it? £ Lk- 

(5) implies that it suffices to prove that 

P^3T>K ,R T >g\B K )<- (6) 
Therefore it suffices to prove that for every s £ 5, 



3T>K, i^>^|£ A - <— . 
V n J 2n 

Hence it suffices to prove that for every s £ S and every a £ A, 
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J = P,(3T>K, R S T a >^\Bi<)<^ (7) 

In order to prove (7) , note that if the inequality Rj a > is satisfied at w, then 
c(a, s) > CR and a is nevertheless considered to be a good action in at least stages 

1 < t < T (w.l.o.g. assume that is an integer). Let b £ A satisfy > Ci?. If 6 is 

ever played in a stage i with sj = s, then a £" for all t >t. Therefore 

7 < -P/" (3T > K, b is not played in the first stages at which s t = s\Bk 

Hence 

( 1 \^ 
7 < 1 

' ~ \ nM 

As (1 - i)^ 1 < e- x for x > 1, 



— 2n 2 



□ 



Theorem 3.1 shows that efficient dynamic non-Bayesian decisions may be obtained by 
an appropriate stochastic policy. Moreover, it shows that 5-optimality can be obtained in 
time which is a (low degree) polynomial in max(n,j). An interesting question is whether 
similar results can be obtained by a pure/deterministic strategy. As the following example 
shows, deterministic strategies do not suffice for that job. 

We give an example in which the agent does not have a 8 optimal pure strategy. 



Example 1: Let A = {a 1 , a 2 } and S = {s 1 , s 2 }. Assume in negation that the agent has a 
8 optimal pure strategy /. 

Consider the following two decision problems whose rows are indexed by the actions and 
whose columns are indexed by the states: 



£>i = 
D 2 = 

with the corresponding ratio matrices: 

Ci = 

C 2 = 



1 10 
30 2 



1 30 
10 2 



30 1 

1 5 

10 1 

1 15 
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Assume in addition that in both cases Nature uses the strategy g , defined as follows: 
g(h t ) = s* if f(h t ) = a', i = 1, 2. That is, for every t, (a t , z t ) = (a 1 , s 1 ) or (a t , z t ) = (a 2 , s 2 ), 
where a t and z t denote the action and state selected at stage t, respectively. Let S < 0.25. 
Let Nj denote Nj for decision problem i. Since / is 5-optimal, there exists K such that 
for every T > K, Nj, > (1 - S)T and Nj, > (1 - S)T. Note also that the same sequence 
((a t , z t ))t>\ is generated in both cases. Nj^ > (1 — 5)K implies that (a 2 , s 2 ) is used at more 
than half of the stages 1, 2, . . . , K. On the other hand, Nf^ > (1 — 5)K implies that (a 1 , s 1 ) 
is used at more than half of the stages 1, 2, . . ., K. A contradiction. 
□ 

For analytical completeness, we end this section by proving the existence of an optimal 
strategy (and not merely a 5-optimal strategy). Such an optimal strategy is obtained by 
utilizing 5 m -optimal strategies (whose existence was proved in Theorem 3.1) for intervals of 
stages with sizes that converge to infinity, when 8 m — > 0. 

Corollary 3.2: In every dynamic decision problem with perfect monitoring there exists 
an optimal strategy. 

Proof: For m > 1, let F m be a ^--optimal strategy, where (# m )m=i ^ s a decreasing 
sequence with linim^oo S m = 0. Let (K m )^ =1 be an increasing sequence of integers such 
that for every m > 1 

Prob (n t > (1 - ^)T for every T > K m ) > 1 - ^, 

and 

V m K ■ 

Am + l > * 7 • 

Let F be the strategy that for m > 1 utilizes F m at the stages Kq + . . . + K m -\ + 1 < t < 
Ji'o + • • • + -fr'm-i + -^'mj where Kq = 0. It can be easily verified that F is optimal. 

□ 



4. Imperfect Monitoring 

We proceed to give an example for the imperfect monitoring case, where for sufficiently 
small 6 > 0, the agent does not have a 5-optimal strategy. 



Example 2 (Non-existence of (^-optimal strategies in the imperfect moni- 
toring case) 

Let A = {a 1 , a 2 }, and S = {s^s^s 3 }. Let 5 < So, where So is defined at the end 
of this proof. Assume in negation that there exists a 5-optimal strategy F. Consider the 
following two decision problems whose rows are indexed by the actions and whose columns 
are indexed by the states: 

(2a 2b 2c \ 
\ a b c I 
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2a 2b 2c 
\ b c a I 

where a, b and c are positive numbers satisfying: a > 46 > 16c. For i = 1,2, let 
d = (ci(a, s)) a £A,seS De the ratio matrices. That is: 

Ci = 





Note that a 1 is the unique CR action in D\ and a 2 is the unique CR action in D 2 . 
Assume that Nature uses strategy G which randomizes at each stage with equal probabilities 
(of |) amongst all 3 states. Given this strategy of Nature, the agent cannot distinguish 
between the two decision problems, even if he knows Nature's strategy and he is told that 
one of them is chosen. This implies that if and ^ are the probability measures induced 
by F and G on (A X S)oo m the decision problems D\ and D 2 respectively, then for every 
i £ {1, 2}, the distribution of the stochastic process (N^ r )^'_ 1 with respect to fj,j, j £ {1, 2}, 
does not depend on j. That is, for every T > 1 

Prob^ (n; £ M t for all t < t) = Prob^ (n} £ M t for alU < r) , i £ {1, 2} (8) 

for every sequence (M t )J =1 with M t C {0, 1, . . . , t} for all 1 < t < T. 

We do not give a complete proof of (8), rather we illustrate it by proving a representing 
case. The reader can easily derive the complete proof. We show that 

Prob m (iV* = 0) = Prob^ ( jVj = 0) (9) 

Indeed, for j = 1,2, 

1 3 

Prob H {N l 2 = 0) = -J2F(e)(a 2 )F(a 2 ,u J (a 2 ,s k ))(a 2 ) (10) 
6 k=i 

Let 7T : {1, 2, 3} -> {1, 2, 3} be defined by tt(1) = 3, tt(2) = 1, and tt(3) = 2. Then 

Ml (a 2 ,s fc ) = M 2 (a 2 ,s OT W) 

for every 1 < k < 3. Therefore (10) implies (9). 

As F is 5-optimal, then there exists an integer K such that with a probability of at least 
1 — 8 with respect to ^1, and hence with respect to ^2, > (1 — 5)T for every T > K. 
This implies that with a probability of at least 1 — 5, a 1 is played at least at 1 — 5 of the 
stages up to time T, for all T > K, and in particular for T = K. We choose the integer K 
to be sufficiently large such that according to the Law of Large Numbers, Nature chooses 
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s 3 in at least | — S of the stages up to stage K with a probability of at least 1 — 5. Let CR2 
and c\ denote CR and c t of decision problem 2, respectively. Then 

a ,2a 2b. 

— > CR 2 = max - , — ). 
2c be 

Therefore, if A t = a 1 , then C2(A t ,St) < Ci?2 if and only if St 7^ s 3 . Hence, with a 
probability of at least 1 — 28, in at most S + (1 — 5)(| + S) of these stages < CRi- 
Therefore F cannot be 5-optimal, if we choose So such that 2So < 1 — So and 

So + (l-So)(^ + S ) <1-S . 

□ 



5. Safety Level 

For the sake of comparison we discuss in this section the safety level (known also as maxmin) 
criterion. Let D =< A, S, u > be a decision problem. Denote 

v = max min u(a, s) 

as 

v is called the safety level of the agent (or the maxmin value) . Every action a for which 
u(a, s) > v for every s is called a safety level action. We consider now the imperfect 
monitoring model for the dynamic decision problem < A, S >. Every sequence of states 
z = (s t )^. 1 with St G S for every t > 1 and every pure strategy / of the agent induce 
a sequence of actions (a t )? =1 and a corresponding sequence of payoffs (u z /)? =1 , where 
= u(a tl St) for every t > 1. Let denote the number of stages up to stage T at 

which the agent's payoff exceeds the safety level v. That is, 

Nj J = #{1 <t <T : u z t J > v} (11) 

We say that / is safety level optimal if for every decision problem and for every sequence of 
states, 

lim —Nr' f = 1, 

and the convergence holds uniformly w.r.t. all payoff functions u and all sequences of states 
in S. That is, for every S > there exists K = K{S) such that > (1 — S)T for every 

T > K for every decision problem < A, S, u > and for every sequence of states z. 

Proposition 5.1: Every dynamic decision problem possesses a safety level optimal strategy 
in the imperfect monitoring case, and consequently in the perfect monitoring case. Moreover, 
such an optimal strategy can be chosen to be strongly efficient in the sense that for every 
sequence of states there exists at most ra^ — 1 stages at which the agent receives a payoff 
smaller than his safety level, where ua denotes the number of actions. 

Proof: Let n = n^. Define a strategy / as follows: Play each of the actions in 
A in the first n stages. For every T > n + 1, and for every history h = hx-i = 
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((«i, Mi), (<22, M2, . . . , («t-1j u T-i)) we define f(h) £ A as follows: For a £ A, let Vj{a) = 
min Mj, where the min ranges over all stages £ < T — 1 for which a t = a. Define f(h) = ay, 
where aj maximizes Vj{a) over a £ A. It is obvious that for every sequence of states 
z = (s t ) c j^L 1 there are at most n — 1 stages at which u(a t ,s t ) < v, where (a t ) c j^L 1 is the 

Z f 

sequence of actions generated by / and the sequence of states. Hence N T > T — n, where 

N% f is defined in (11). Thus for K(S) = f , ±Nj J > 1 - 5 for every T > K(S). 

□ 

6. Discussion 

Note that all the notations established in Section 5, and Proposition 5.1 as well, remain 
unchanged if we assume that the utility function u takes values in a totally pre-ordered 
set without any group structure. That is, our approach to decision making is qualitative 
(or ordinal). This distinguishes our work from previous work on non-Bayesian repeated 
games, which used the probabilistic safety level criterion as a basic solution concept for the 
one shot game 15 . These works, including (Blackwell, 1956; Hannan, 1957; Banos, 1968; 
Megiddo, 1980), and more recently (Auer, Cesa-Bianchi, Freund, & Schapire, 1995; Hart 
& Mas-Colell, 1997), used several versions of long-run solution concepts, all based on some 
optimization of the average of the utility values over time. That is, in all of these papers 
the goal is to find strategies that guarantee that with high probability ^ J2t=i u (At, St) will 
be close to v p . 

Our work is, to the best of our knowledge, the first to introduce an efficient dynamic 
optimal policy for the basic competitive ratio context. Our study and results in sections 2-4 
can be easily adapted to the case of qualitative competitive ratio as well. In this approach, 
the utility function takes values in some totally pre-ordered set G and in addition we assume 
a regret function, if) that maps G X G to some pre-ordered set H. For g\,§2 G G, ip(gi,g2) 
is the level of regret when the agent receives the utility level g\ rather than g^. Given an 
action a and a state s, the regret function will determine the maximal regret, c(a, s) £ H 
of the agent when action a is performed in state s. That is, 

c(a, s) = maxi/i(w(a, s), u(b, s)), 

where b ranges over all actions. 

The qualitative regret of action a will be the maximal regret of this action over all states. 
The optimal qualitative competitive ratio will be obtained by using an action for which the 
qualitative regret is minimal. Notice that no arithmetic calculations are needed (or make 
any sense) for this qualitative version. Our results can be adapted to the case of qualitative 
competitive ratio. For ease of exposition, however, we used the quantitative version of this 
model, where a numerical utility function represents the regret function. 

15. The probabilistic safety value, v p , of the agent in the decision problem D =< A, S,u > is his maxmin 
value when the max ranges over all mixed actions. That is 

v p = max qeA ( A )min ee s "(a, s)q(a), 

aeA 

where A(A) is the set of all probability distributions q on A. 



245 



MONDERER AND TENNENHOLTZ 



Our work is relevant to research on reinforcement learning in AI. Work in this area, 
however, has dealt mostly with Bayesian models. This makes our work complementary to 
this work. We wish now to briefly discuss some of the connections and differences between 
our work and existing work in reinforcement learning. 

The usual underlying structure in the reinforcement learning literature is that of an en- 
vironment which changes as a result of an agent's action based on a particular probabilistic 
function. The agent's reward may be probabilistic as well. 16 In our notation, a Markov 
decision process (MDP) is a repeated game against Nature with complete information and 
strategic certainty, in which Nature's strategy depends probabilistically on the last action 
and state chosen 17 . Standard partially observable MDP (POMDP) can be described simi- 
larly by introducing a level of monitoring in between perfect and imperfect monitoring. In 
addition, bandit problems can be basically modeled as repeated games against Nature with 
a probabilistic structural assumption about Nature's strategy , but with strategic uncer- 
tainty about the values of the transition probabilities. For example, Nature's action can 
play the role of the state of a slot machine in a basic bandit problem. The main differ- 
ence between the classical problem and the problem discussed in our setting is that the 
state of the slot machine may change in our setting in a totally unpredictable manner (e.g., 
the seed of the machine is manually changed at each iteration). This is not to say that 
by solving our learning problem we have solved the problem of reinforcement learning in 
MDP, in POMDP, or in bandit problems. In the later settings, our optimal strategy behave 
poorly relative to strategies obtained in the theory of reinforcement learning, that take the 
particular structure into account. 

The non-Bayesian and qualitative setup call for optimality criteria which differ from 
the ones used in current work in reinforcement learning. Work in reinforcement learning 
discusses learning mechanisms that optimize the expected payoff in the long run. In a 
qualitative setting (as described above) long-run expected payoffs may not make much 
sense. Our optimality criteria expresses the need to obtain a desired behavior in most 
stages. One can easily construct examples where one of these approaches is favorite to the 
other one. Our emphasis is on obtaining the desired behavior in a relatively short run. 
Though, most analytical results in reinforcement learning have been concerned only with 
eventual convergence to desired behavior, some of the policies have been shown to be quite 
efficient in practice. 

In addition to the previously mentioned differences between our work and work in re- 
inforcement learning, we wish to emphasize that much work on POMDP uses information 
structures which are different from those discussed in this paper. Work on POMDP usually 
assumes that some observations about the current state may be available (following the pre- 
sentation by Smallwood & Sondik, 1973), although observations about the previous state 
are discussed as well (Boutilier & Poole, 1996). Recall that in the case of perfect monitoring 
the previous environment state is revealed, and the immediate reward is revealed in both 
prefect and imperfect monitoring. It may be useful to consider also situations where some 



16. The results presented in this paper can be extended to the case where there is some randomness in the 
reward obtained by the agents as well. 

17. Likewise, stochastic games (Shapley, 1953) can be considered as repeated games against Nature with 
partial information about Nature's strategy. For that matter one should redefine the concept of state in 
such games. A state is a pair (s, a), where s is a state of the system and a is an action of the opponent. 
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(partial) observations about the previous state or the current state are revealed from time 
to time. How this may be used in our setting is not completely clear, and may serve as a 
subject for future research. 
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